Soochow University Word Segmenter for SIGHAN 2012 Bakeoff

نویسندگان

  • Yan Fang
  • Zhongqing Wang
  • Shoushan Li
  • Zhongguo Li
  • Richen Xu
  • Leixin Cai
چکیده

This paper presents a Chinese Word Segmentation system on MicroBlog corpora for the CIPS-SIGHAN Word Segmentation Bakeoff 2012. Our system employs Conditional Random Fields (CRF) as the segmentation model. To make our model more adaptive to MicroBlog, we manually analyze and annotate many MicroBlog messages. After manually checking and analyzing the MicroBlog text, we propose several pre-processing and post-processing rules to improve the performance. As a result, our system obtains a competitive F-score in comparison with other participating systems.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Nanjing Normal University Segmenter for the Fourth SIGHAN Bakeoff

This paper expounds a Chinese word segmentation system built for the Fourth SIGHAN Bakeoff. The system participates in six tracks, namely the CityU Closed, CKIP Closed, CTB Closed, CTB Open, SXU Closed and SXU Open tracks. The model of Conditional Random Field is used as a basic approach in the system, with attention focused on the construction of feature templates and Chinese character categor...

متن کامل

A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005

We present a Chinese word segmentation system submitted to the closed track of Sighan bakeoff 2005. Our segmenter was built using a conditional random field sequence model that provides a framework to use a large number of linguistic features such as character identity, morphological and character reduplication features. Because our morphological features were extracted from the training corpor...

متن کامل

Towards a Hybrid Model for Chinese Word Segmentation

This paper describes a hybrid Chinese word segmenter that is being developed as part of a larger Chinese unknown word resolution system. The segmenter consists of two components: a tagging component that uses the transformation-based learning algorithm to tag each character with its position in a word, and a merging component that transforms a tagged character sequence into a word-segmented sen...

متن کامل

An Improved Chinese Word Segmentation System with Conditional Random Field

In this paper, we describe a Chinese word segmentation system that we developed for the Third SIGHAN Chinese Language Processing Bakeoff (Bakeoff2006). We took part in six tracks, namely the closed and open track on three corpora, Academia Sinica (CKIP), City University of Hong Kong (CityU), and University of Pennsylvania/University of Colorado (UPUC). Based on a conditional random field based ...

متن کامل

Using Part-of-Speech Reranking to Improve Chinese Word Segmentation

Chinese word segmentation and Part-ofSpeech (POS) tagging have been commonly considered as two separated tasks. In this paper, we present a system that performs Chinese word segmentation and POS tagging simultaneously. We train a segmenter and a tagger model separately based on linear-chain Conditional Random Fields (CRF), using lexical, morphological and semantic features. We propose an approx...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012